feat(aci): Database-tracked coordinated task-based DetectorGroup backfill #102371

kcons · 2025-10-29T22:39:43Z

This approach allows the backfill to be run to completion and tracked without creating significant task backlog.
The first phase is a one-time status row creation, which is run in a task loop.
From there, we can start triggering the coordinator periodically and having it manage scheduling tasks for work items, first very slowly to verify, then in increasing volume.
We can ensure that the capacity cost of this backfill is relatively fixed regardless of processing rate, and failed tasks can naturally be rescheduled without them starving out possibly successful tasks.

This PR is structured as a framework and a job using it even though we may never actually need to reuse the framework because this model makes it easier to review the job management and the error backfill elements separately; not having the separation is less code, but ultimately a larger conceptual chunk. By separating it, we also have the option of doing more bulk processing of this sort.

Process-wise, we'd:

add a job to create the bulk job status chunks, run it, validate.
Add a cron trigger to the coordinator task, targeting 1 run at a time at most.
Verify that the low-rate processing is working as intended, then increase concurrent job count gradually.
Depending on burn down rate and timeline, tweak job count to something sustainable that wont backlog our cluster, and set up a dashboard to show how soon we'll be done.

Once done, we can delete and drop the table, or we can leave it for reuse.

src/sentry/workflow_engine/processors/backfill.py

github-actions · 2025-10-29T22:47:32Z

This PR has a migration; here is the generated SQL for src/sentry/workflow_engine/migrations/0094_add_error_backfill_status.py

for 0094_add_error_backfill_status in workflow_engine

--
-- Create model ErrorBackfillStatus
--
CREATE TABLE "workflow_engine_error_backfill_status" ("id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, "date_updated" timestamp with time zone NOT NULL, "date_added" timestamp with time zone NOT NULL, "status" varchar(20) NOT NULL, "detector_id" bigint NOT NULL UNIQUE);
ALTER TABLE "workflow_engine_error_backfill_status" ADD CONSTRAINT "workflow_engine_erro_detector_id_6e5eb8d9_fk_workflow_" FOREIGN KEY ("detector_id") REFERENCES "workflow_engine_detector" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "workflow_engine_error_backfill_status" VALIDATE CONSTRAINT "workflow_engine_erro_detector_id_6e5eb8d9_fk_workflow_";
CREATE INDEX CONCURRENTLY "workflow_engine_error_backfill_status_status_3d9773bb" ON "workflow_engine_error_backfill_status" ("status");
CREATE INDEX CONCURRENTLY "workflow_engine_error_backfill_status_status_3d9773bb_like" ON "workflow_engine_error_backfill_status" ("status" varchar_pattern_ops);
CREATE INDEX CONCURRENTLY "errbkfl_stat_upd_idx" ON "workflow_engine_error_backfill_status" ("status", "date_updated");

getsantry · 2025-11-20T08:00:11Z

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

…fill

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 29, 2025

semgrep-code-getsentry bot reviewed Oct 29, 2025

View reviewed changes

src/sentry/workflow_engine/processors/backfill.py Outdated Show resolved Hide resolved

getsantry bot added the Stale label Nov 20, 2025

kcons added 5 commits November 21, 2025 10:46

feat(aci): Database-tracked coordinated task-based DetectorGroup back…

36bbf26

…fill

rebase

a18ffdc

rebase

39e35be

general

22f62f2

more

af4e036

kcons force-pushed the kcons/beefybackfill branch from 49c2c5b to af4e036 Compare November 25, 2025 19:04

vercel bot deployed to Preview November 25, 2025 19:06 View deployment

tweak

193c0c1

vercel bot deployed to Preview November 25, 2025 19:38 View deployment

cleanup

87af63b

vercel bot deployed to Preview November 25, 2025 21:52 View deployment

getsantry bot removed the Stale label Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(aci): Database-tracked coordinated task-based DetectorGroup backfill #102371

feat(aci): Database-tracked coordinated task-based DetectorGroup backfill #102371

Uh oh!

kcons commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

getsantry bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat(aci): Database-tracked coordinated task-based DetectorGroup backfill #102371

Are you sure you want to change the base?

feat(aci): Database-tracked coordinated task-based DetectorGroup backfill #102371

Uh oh!

Conversation

kcons commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

getsantry bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kcons commented Oct 29, 2025 •

edited

Loading